Apparatus and method for coding speech signals by making use of an adaptive codebook

ABSTRACT

An audio signal encoding device is provided including an input for receiving a sub-frame of an audio signal to be encoded, an adaptive codebook and a processing unit. The adaptive codebook stores at least one prior knowledge entry which includes a data element representative of characteristics of at least a portion of a previously generated audio signal sub-frame. The processing unit generates a set of parameters allowing for synthesization of the audio signal sub-frame received at the input on the basis of at least the sub-frame of the audio signal received at the input and the data element stored in the adaptive codebook. A corresponding decoding device for synthesizing an audio signal on the basis of a set of parameters is also provided.

This is a divisional of prior application Ser. No. 09/107,385 filingdate Jun. 30, 1998.

FIELD OF THE INVENTION

This invention relates to the field of processing audio signals, such asspeech signals that are compressed or encoded with a digital signalprocessing technique. More specifically, the invention relates to animproved method and an apparatus for coding speech signals that can beparticularly useful in the field of wireless communications.

BACKGROUND OF THE INVENTION

In communication applications where channel bandwidth is at a premium,it is essential to use the smallest possible portion of a transmissionchannel in order to transmit a voice signal. A common solution is toprocess the voice signal with an apparatus called a speech codec beforeit is transmitted on a RF channel.

Speech codecs, including an encoding and a decoding stage, are used tocompress (and decompress) the digital signals at the source andreception point, respectively, in order to optimize the use oftransmission channels. By encoding only the necessary characteristics ofa speech signal, fewer bits need to be transmitted than what is requiredto reproduce the original waveform in a manner that will notsignificantly degrade the speech quality. With fewer bits required,lower bit rate transmission can be achieved.

Most state-of-the-art codecs are based on the original CELP modelproposed by Schroeder and Atal in “Code-Excited Linear Prediction(CELP): High Quality Speech at Very Low Bit Rates,” Proceedings ofICASSP, pp. 937-940, 1985. This document is hereby incorporated byreference. This basic codec model has been improved in many aspects toachieve bit rates cf approximately 8 kbits/sec and even lower, but voicequality in those with lower bit rates may not be acceptable fortelephony applications. An example of an 8 kbits/sec codec is fullydescribed in version 5.0 of the International Telecommunication UnionTelecommunications Standardization Sector (ITU-TSS) Draft:recommendation G.729 “Coding of speech at 8 kbits/s usingConjugate-Structure Algebraic-Code-Excited Linear-Predictive (CS-ACELP)coding”, dated Jun. 8, 1995. This document is hereby incorporated byreference

Considering that lower bit rates at acceptable speech quality levelsprovide great economical advantages, there exists a need in the industryto provide an improved speech coding apparatus and method particularlywell suited for telecommunications applications.

OBJECTIVES AND SUMMARY OF THE INVENTION

A general object of the invention is to provide an improved audio signalcoding device, such as a Linear Predictive (LP) encoder, that achievesaudio coding at low bit rates while maintaining audio quality at a levelacceptable for communication applications.

In this specification, the term “filter coefficients” is intended torefer to any set of coefficients that uniquely defines a filter functionthat models the spectral characteristics of an audio signal. Inconventional audio signal encoders, several different types ofcoefficients are known, including linear prediction coefficients,reflection coefficients, arcsines of the reflection coefficients, linespectrum pairs, log area ratios, among others. These different types ofcoefficients are usually related by mathematical transformations andhave different properties that suit them to different applications.Thus, the term “filter coefficients” is intended to encompass any ofthese types of coefficients.

In this specification, the term “excitation segment” is defined asinformation that needs to be combined with the filter coefficients inorder to provide a complete representation of the audio signal. Suchexcitation segment may include parametric information describing theperiodicity of the speech signal, a residual (often referred to as“excitation signal”) as computed by the encoder of a vocoder, speechframing control information to ensure synchronous framing in the decoderassociated with the remote vocoder, pitch periods, pitch lags, gains andrelative gains, among others.

In this specification, the term “sample” refers to the amplitude valueat one specific instant in time of a signal. PCM (Pulse Code Modulation)is a form of coding of an analog signal that produces plurality ofsamples, each sample representing the amplitude of the waveform at acertain time.

The term “audio signal subframe” refers to a set of samples thatrepresent a portion of an audio signal such as speech. For example, inan embodiment of this invention, subframes of 40 samples were used.Also, “audio signal frames” are defined as a plurality of samples sets,each set being representative of a sub-frame. In a specific example, anaudio signal frame has four sub-frames.

In a most preferred embodiment, the audio signal-encoding device encodesan audio signal, such as a speech signal differently in dependence uponthe voiced/unvoiced characteristics of the signal. In a most preferredembodiment, the audio signal encoding device includes two signalsynthesis stages, one better suited for unvoiced signals and one bettersuited for voiced signals. In operation, each signal synthesis stagegenerates a synthesized speech signal based on a set of parameters, suchas filter coefficients and excitation segment computed to bestapproximate the input speech signal sub-frame. The two synthesizedsignals are compared and the one that manifests less error with respectto the input speech signal is selected as being the best match and theparameters previously computed for this synthesized signal are the onesused to form the compressed or encoded audio signal sub-frame.

The major difference between the signals produced by the voiced signalsynthesis stage and the unvoiced signal synthesis stage reside in theperiodicity or pitch of the signals. The synthesized voiced signalmanifests a higher periodicity than the synthesized unvoiced signal.

In a specific example, the voiced signal synthesis stage comprises anadaptive codebook containing prior knowledge entries that are past audiosignal sub-frames. The output of this codebook provides the periodiccomponent of the signal generated by the voiced signal synthesis stage.Selecting an entry from a pulse stochastic codebook and passing thisentry into a synthesis filter produces the aperiodic component.

The unvoiced signal synthesis stage comprises a noise stochasticcodebook that issues a sample noise signal used as input to a synthesisfilter. The output of the synthesis filter is the synthetic unvoicedaudio signal.

In accordance with a broad aspect., the invention provides an audiosignal encoding device, including an input for receiving a sub-frame ofan audio signal to be encoded, an adaptive codebook and a processingunit. The adaptive codebook stores at least one prior knowledge entry,the prior knowledge entry including a data element representative ofcharacteristics of at least a portion of a previously synthesized audiosignal sub-frame. The processing unit is in operative relationship withthe input and with the adaptive codebook and generates a set ofparameters allowing to generate a certain synthesized audio signalsub-frame, on the basis of at least the sub-frame of the audio signalreceived at the input and the data element in the adaptive codebook.

In accordance with another broad aspect, the invention provides an audiosignal decoding device for synthesizing a certain audio signal sub-framefrom a set of parameters derived from an original audio signalsub-frame. The audio signal decoding device includes an input forreceiving the set of parameters derived from the original audio signalsub-frame, an adaptive codebook and a processing unit. The adaptivecodebook stores at least one prior knowledge entry including a dataelement representative of characteristics of at least a portion of apreviously synthesized audio signal sub-frame synthesized by the audiosignal decoding device The processing unit is in operative relationshipwith the input and with the adaptive codebook and synthesizes thecertain audio signal sub-frame on a basis of at least the set ofparameters received at the input and the data element in the adaptivecodebook.

In accordance with another broad aspect, the invention provides a methodfor synthesising a certain audio signal sub-frame from a set ofparameters derived from an original audio signal sub-frame. The set ofparameters derived from the original audio signal sub-frame is received.An adaptive codebook in which is stored at least one prior knowledgeentry is provided where the prior knowledge entry includes a dataelement representative of characteristics of at least a portion of apreviously synthesized audio signal sub-frame. The certain audio signalsub-frame is synthesized on a basis of at least the set of parametersreceived and the data element in the adaptive codebook.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating the concept of audio signalencoding and decoding process that takes place in a telecommunicationsystem or any other environment where audio signals in encoded orcompressed form are being transmitted;

FIG. 2 is a block diagram showing a prior art audio signal encoder;

FIG. 3 is a block diagram of an audio signal encoder constructed inaccordance with the present invention;

FIG. 4 is a block diagram of a signal processing device built inaccordance with an embodiment of the invention and that can be used toimplement the function of the encoder described in FIG. 3;

FIG. 5 is a block diagram of an apparatus for smoothing sub-framesaccording to an embodiment of the present invention; and

FIG. 6 is a block diagram of an apparatus for smoothing sub-frames inaccordance to a variant.

DESCRIPTION OF A PREFERRED EMBODIMENT

A prior art speech encoder/decoder combination is depicted in FIG. 1. APCM (Pulse Coded Modulation) speech signal 100 is input to a CELP (CodeExcited Linear Prediction) encoder 120 that processes the audio signalprovided and produces a representation of the signal in a compressedform. A single sub-frame of this signal in encoded form is representedby a set of parameters comprising filter coefficients and an excitationsegment. The signal sub-frame is transported over a communicationchannel 105, which carries it to a CELP decoder 130. The signalsub-frame is processed by the decoder 130 that uses the filtercoefficients and the excitation segment to synthesize the audio signal.

CELP encoders are the most common type of encoders used in telephonypresently. CELP encoders send index information that points to a set ofvectors in adaptive and stochastic codebooks. That is, for each speechsignal sub-frame, the encoder searches through its codebook(s) for theone that gives the best perceptual match to the speech input when usedas an excitation to the LPC synthesis filter.

FIG. 2 is a block diagram of a prior art CELP encoder. It can be notedthat in this version of encoder 120 is provided an arrangement ofsub-components that are an exact replica of a speech decoder, such as130, that could be used to return the compressed speech to the PCM form.Box 290 illustrates these sub-components.

The encoder has an input that receives successive sub-frames of the PCMaudio signal, such as speech signal 201. A signal sub-frame is input toan LPC analysis block 200 and to the adder 202. The LPC analysis block200 outputs the LPC filter coefficients 204 for this sub-frame fortransmission on the communication channel 105, as an input to an LPCsynthesis filter 205, and as an input to a perceptual weighting filter225. At the adder 202, the output 256 of the LPC synthesis filter 205 issubtracted from the FCM speech signal 201 to produce an error signal257. The error signal 257 is sent to a perceptual weighting filter 225followed by an error minimization processor 227 that outputs the pitchgain value 233, the lag value 232, the codebook index 234, and thestochastic gain value 235 that are transmitted over the communicationchannel 105.

The error minimization processor 227 compares the error signal outputfrom the perceptual weighting filter 225 and, when the smallest errorsignal is achieved fox a speech subframe, it signals the encoder 120 tosend the compressed speech data for this speech subframe oncommunication channel 105. In this example, the compressed speech dataincludes the filter coefficients 204, the pitch gain value 233, the lagvalue 232, the codebook index 234, and the stochastic gain value 235. Inorder to achieve the smallest error for a speech subframe, the errorminimization processor 227 sequentially generates new pitch gain and lagvalues and stochastic codebook indexes. Those new values are processedthrough a feedback loop to produce a new synthetic audio signalsub-frame that is again compared to the actual signal 201 sub-frame.When a minimal error is reached the filter coefficients and theexcitation subframe computed to produce such minimal error are releasedfor transport over the communication channel 105.

More specifically, the lag value 232 is also sent back to the adaptivecodebook 215 to effect a backward adaptation procedure, and thus selectthe best waveform from the adaptive codebook 215 to match the inputspeech signal 201. The adaptive codebook 215 outputs the periodiccomponent of the speech signal to the multiplier 237 wheremultiplication with the pitch gain 233 is effected and whose output issent to the adder 212.

The code index 234 for its part is also fed back to the stochasticcodebook 220. The stochastic codebook 220 outputs the aperiodiccomponent of the speech signal to the multiplier 242 wheremultiplication with the stochastic gain 235 is effected and whose outputis sent to the adder 212.

At adder 212, the output of the multiplier 237 is added to the output ofthe multiplier 242 to form the complete excitation 254. The excitation254 is fed back to the adaptive codebook 215 so that it may update itsentries. The excitation 254 is also filtered by the LPC synthesis filter205 to produce a reconstructed speech signal 256. The reconstructedspeech signal 256 is fed to the adder 202.

The representation of the transfer function of a CELP codec as describedin FIG. 2 is given by:

i(n)=[g_(p)a(n−L)+g_(pl)b(n)]{circle around (X)}h_(i)(n)+e(n)

where i(n), n=1, . . . , N is the input sequence to be approximated;

a(n−L) is the ACB sequence selected;

g_(p) is the pitch gain parameter adjusted to maximize the pitchprediction gain;

b(n) is a sparse impulse sequence (unit energy) taken from the SCB;

g_(pl) is a pulse gain parameter;

h_(i)(n) is the impulse response of an all-pole LPC synthesis filterderived from the input signal;

e(n) is an error sequence to be minimized (after perceptual weighting);and

{circle around (X)} represents discrete convolution.

FIG. 3 provides a block diagram of an audio signal encoder in accordancewith an embodiment of the invention. It can be noted that in thisversion of encoder 120 is provided an arrangement of sub-component thatare an exact replica of a speech decoder, such as 130, that could beused to return the compressed speech to the PCM form. Box 390illustrates these sub-components.

The only input to encoder 120 is the original PCM speech signal 301sub-frame. In this embodiment of the invention, the outputs forming thecompressed speech data when the speech subframe is voiced are differentfrom when it is unvoiced. When it is determined that the speech signalis voiced, the compressed speech data includes a first set ofparameters, comprising the filter coefficients 359, the pitch gain value350, the lag value 332, the pulse codebook index 334, the pulse gainvalue 352, and the voiced/unvoiced control signal 362. When the speechsignal is unvoiced, the compressed speech data includes a second set ofparameters, comprising the filter coefficients 304, the noise codebookindex 333, the noise gain value 358, and the voiced/unvoiced controlsignal 362.

Three codebooks are provided in the encoder 120; namely, the adaptivecodebook 315, the pulse stochastic codebook 320 and the noise stochasticcodebook 330. The decoder 130 must possess codebooks having the sameentries as those in the encoder 120 codebooks in order to produce speechof good quality. The parameters 332, 333, 334, 350, 352, and 358selected by the error minimization processor 327 are also fed back ascontrol signals to codebooks 315, 320 and 330 and to gain multipliers337, 342, and 344. The control values to the three codebooks 315, 320and 330 and to the three gain multipliers 337, 342 and 344 aredetermined from an sequential process that chooses the smallest weightederror 363 between the reconstructed speech signal 365 and the originalspeech signal 301.

The adaptive codebook 315 is a memory space that stores at least onedata element representative of the characteristics of at least a portionof a past audio signal subframe. In a specific example, the codebook 315stores a sequence of past reconstructed speech samples of a lengthsufficient to include a delay corresponding to the maximum pitch lag.The number of past reconstructed speech samples may vary, but for speechsampled at 8 kHz, a codebook containing 140 samples (this is equivalentto 3.5 past reconstructed or synthesized audio signal sub-frames) isgenerally sufficient. In this example, each data element is associatedwith a past-reconstructed audio signal subframe. In other words, eachdata element covers 40 samples. The codebook 315 may be in a bufferformat that simply uses the pitch lag 332 applied to an input of thecodebook as a pointer to the start of the subframe to be extracted andthat appears at an output of the codebook.

The adaptive codebook 315 is updated with input 356 that is arepresentation of the reconstructed speech signal 354 after it has beenlow-pass filtered by the low-pass filter 365. The function of thelow-pass filter 365 is to attenuate the high-frequency component whichmanifests weaker periodicity. Input 356 is stored as the last 40 sampledata element in the adaptive codebook's table 315. The oldest table 40sample data element of the adaptive codebook 315 is deletedconcurrently.

The pulse stochastic codebook 320 and the noise stochastic codebook 330are used to derive the aperiodic component of the reconstructed speechsignal 365. Both these codebooks 320 and 330 are memory devices that arefixed in time. The pulse stochastic codebook 320 stores a certain numberof separately generated pulse-like entries (i.e., few non-zero pulses).The pulse-like entries may also be called “vectors”. The number ofentries may vary, but in an embodiment of this invention, a pulsestochastic codebook 320 containing 512 entries has been used and workswell. In this embodiment, 40 of the entries are vectors comprising onlyone non-zero value (i.e., one pulse), and the remaining 472 entries arevectors comprising two pulses of equal magnitude and opposite sign. Thecodebook vectors actually used are selected from the list of allpossible such vectors by a codebook training process. The processeliminates the least frequently used vectors when coding a training setof several spoken sentences. The codebook 320 may be in a table formatthat simply uses the pulse codebook index 334 as a pointer to one of thevectors to be used. Upon receiving the code index 334, the pulsestochastic codebook 320 outputs the chosen table entry to multiplier342.

The noise stochastic codebook 330 stores a certain number of noise-likeentries. The noise-like entries are derived from a gaussiandistribution. The noise-like vectors, which are entries to the noisestochastic codebook, are populated by outputs from a pseudo-randomgaussian noise generator whose variance is adjusted to provide unitvector energy. The number of vectors may vary, but a noise stochasticcodebook 330 containing as few as 16 entries has been used and workswell. The codebook 330 may be in a table format that simply uses thenoise codebook index 334 as a pointer to the noise vector to be used.Upon receiving the code index 333, the noise stochastic codebook 330outputs the chosen table entry to multiplier 344.

Two LPC synthesis filters 305 and 307 are also provided in encoder 120.Both LPC synthesis filters 305 and 307 are the inverses of quantizedversions of short-term linear prediction error filters (310 and 300respectively) minimizing, in the case of 310, the energy of theprediction residual error 357 and, in the case of 300, the energy of theinput residual error 301. LPC synthesis filters are well-known to thoseskilled in the art and will not be further described here.

A low-pass filter 365 is provided in encoder 120 for enhancing thecorrelation between the speech subframe under analysis andpast-reconstructed speech subframes. In a preferred embodiment, thelow-pass filter 365 is a five tap Finite Impulse Response (FIR) filterwith attenuation specified at two frequencies. Suitable values forattenuation are as follows: 4 dB at 2 kHz, and 14 dB at 4 kHz. Low-passFIR filters are well-known to those skilled in the art and will not befurther described here.

The voiced/unvoiced switch 360 chooses the reconstructed speech signal365 (354 or 353) that will be sent to the adder 302 of a syntheticsignal analyser that also includes the perceptual weighting filter 325and the error minimization processor 327 based upon the voiced/unvoicedcontrol signal 362. Control signal 362 is output from the errorminimization processor 327 and is based upon its calculation of whichsignal (354 or 353) will result in the smallest error 363 inrepresenting the input speech signal 301. The least means square methodmay be used to calculate the smallest error 363. In effect, controlsignal 362 will instruct the voiced/unvoiced switch 360 to choose thereconstructed speech signal 354 when the input speech signal 301 isvoiced or, on the other hand, choose the reconstructed speech signal 353when the input speech signal 301 is unvoiced.

The perceptual weighting filter 325 is a linear filter that attenuatesthose frequencies where the error is perceptually less important andthat amplifies those frequencies where the error is perceptually moreimportant. Perceptual weighting filters are very well known to thoseskilled in the art and will not be further described here.

The error minimization processor 327 uses the error signal output fromthe perceptual weighting filter 325 and, when the sequential calculationof error signal is completed for a speech subframe, it signals theencoder 120 to send the compressed speech data producing the smallesterror signal for the current speech subframe on communication channel105. In order to achieve the smallest error for a speech subframe, theerror minimization processor 327 comprises at least three subcomponents;that is, a pitch gain and lag calculator, a pulse codebook index andgain calculator, and a noise codebook index and gain calculator. It isthe values output by these calculators that the encoder 120 uses toproduce different error signals 363 and to determine, from these, thesmallest one.

The audio signal encoder illustrated in FIG. 3 and as described indetail above thus includes two voiced signal synthesis stages, namely avoiced signal synthesis stage that produces a first synthetic audiosignal and an unvoiced signal synthesis stage that produces a secondsynthetic audio signal- The voiced audio signal synthesis stage includesthe adaptive codebook 315, the pulse stochastic codebook 320 and the LPCsynthesis filter 305. The set of samples that are output from theadaptive codebook 315 and that are multiplied by the gain at the gainmultiplier 337 form the periodic component of the first synthetic audiosignal. The aperiodic component of the first synthetic audio signal isobtained by passing the output of the pulse stochastic codebook 320through the LPC synthesis filter 305 that receives the filtercoefficients computed for the current sub-frame from the LPC analysisand quantizer block 310. The adder sums the periodic and the aperiodiccomponents as output by the gain multiplier 355 and the LPC synthesisfilter 305, respectively, to generate the first synthetic audio signalsub-frame.

The unvoiced signal synthesis stage includes the noise stochasticcodebook 330 and the LPC synthesis filter 307. The latter receives thefilter coefficients for the current sub-frame from the LPC analysis andquantizer block 310 and processes the output of the noise stochasticcodebook 330 to generate the second synthetic audio signal sub-frame.The two synthetic audio signal sub-frames are then applied to the switch360 that selects one of the signals and passes the signal to thesynthetic signal analyzer.

An example of a basic sequential algorithm used to calculate thesmallest value of the error signal follows. First, set the switch 360 tothe voiced position such that the voiced synthetic signal will beapplied to the synthetic signal analyser. Second, calculate the value ofthe error signal using a set of lag values 332 in the ACB 315 and thegain values in the multiplier 337 and storing the values of the errorsignal in a memory space. From the values of the error signal for theACB 315 alone, chose the smallest one and, with the lag value 332 andgain value 350 used to obtain this result, calculate new error valuesusing the index value 334 that are input to the pulse stochasticcodebook 320 and the gain values that are input to the multiplier 342.If the error signal is sufficiently reduced, declare the subframe“voiced”, leave the switch 360 to the voiced position, and send thevarious indices and values used to obtain the smallest error signal forthis “voiced” subframe on the communication link 105. If, on the otherhand, it is not possible to achieve a sufficiently small error signalusing the pulse stochastic codebook 320, the subframe is declared“unvoiced”, the switch 360 is set to the unvoiced position, and a thirdset of error values is calculated using the index values 333 that areinput to the noise stochastic codebook 330 and the gain values 358 thatare input to the multiplier 344. The various indices and values used toobtain the smallest error signal for this “unvoiced” subframe are senton the communication link 105. The error minimization processor 327 alsocalculates the control signal 362, which was described earlier. Errorminimization processors are very well-known to those skilled in the artand will not be further described here.

The following paragraphs describe the flow and evolution of the varioussignals in an encoder 120. An input speech signal 301 is first fed tothe LPC analysis block 300, to adder 306 and to adder 302. The LPCanalysis block 300 produces LPC filter coefficients 304 that are fed tothe perceptual weighting filter 325 and to the LPC quantizer 370. Thequantized versions of the filter coefficients 374 are fed to the LPCsynthesis filter 307. The quantized LPC filter coefficients are alsosent to the communication channel 105 upon calculation of the bestparameters to represent the speech signal subframe being considered.

At adder 302, the error signal 363 is calculated as the result of thesubtraction of the reconstructed speech signal 365 (354 or 353) from theinput speech signal 301. This error signal 363 is fed to the perceptualweighting filter 325. Based on the LPC coefficients 304, the perceptualweighting filter 325 modifies the spectrum of the error signal for bestmasking of the current speech subframe before calculating the errorenergy. This modified error signal is forwarded to the errorminimization processor 327 that calculates, through a closed-loopanalysis, the compressed speech outputs that will best represent theinput speech signal 301. When it is determined that the speech signal isvoiced, the compressed speech data includes the quantized filtercoefficients 359, the pitch gain value 350, the lag value 332, the pulsecodebook index 334, the pulse gain value 352, and the voiced/unvoicedcontrol signal 362. When it is determined that the speech signal isunvoiced, the compressed speech data includes the quantized filtercoefficients 374, the noise codebook index 333, the noise gain value358, and the voiced/unvoiced control signal 362. The error minimizationprocessor 327 also calculates the control signal 362.

The lag value 332 is fed back to the adaptive codebook 315. It will actas a pointer to determine, from the adaptive codebook 315, the start ofthe speech subframe which will be chosen to output to multiplier 337.The pitch gain value 350 is fed back directly to multiplier 337. Themultiplier 337 uses the pitch gain 350 and the output of the adaptivecodebook 315 to produce a pitch prediction signal 355. The pitchprediction signal 355 is fed to adders 306 and 312.

At adder 306, the pitch prediction signal 355 is subtracted from theinput speech signal 301 to produce the pitch prediction residual 357.Having removed the periodic component (i.e., the pitch prediction signal355) from the input speech signal 301, what remains is an aperiodicsignal (i.e., the pitch prediction residual 357). The pitch predictionresidual 357 is fed to the LPC analysis and quantization block 310(similar to block 300 discussed earlier) that produces LPC coefficients359. These coefficients 359 are further fed to the LPC synthesis filter305.

The pulse codebook index 334 is fed back to the pulse stochasticcodebook 320. It will act as a pointer to determine, from the stochasticcodebook 320, which pulse-like vector will be chosen to output tomultiplier 342. The pulse gain value 352 is fed back directly tomultiplier 342. The multiplier 342 uses the pulse gain and lag values352 and the output of the pulse stochastic codebook 320 to produce anexcitation signal 351. The excitation signal 351 is fed to the LPCsynthesis filter 305. Along with LPC coefficients 359, the LPC synthesisfilter 305 produces the aperiodic component 364 of a voiced speechsignal. This aperiodic component 364 is added to the periodic component355 to produce the reconstructed speech signal 354. The reconstructedspeech signal 354 is returned to the adaptive codebook through afeedback loop and is also fed to the voiced/unvoiced switch 360.

The noise codebook index 333 is fed back to the noise stochasticcodebook 330. It will act as a pointer to determine, from the noisestochastic codebook 330, which noise-like vector will be chosen tooutput to multiplier 344. The noise gain value 358 is fed back directlyto multiplier 344. The multiplier 344 uses the noise gain and lag values358 and the output of the noise stochastic codebook 330 to produce anexcitation signal 361. The excitation signal 361 is fed to the LPCsynthesis filter 307. With LPC coefficients 304, the LPC synthesisfilter 307 produces a reconstructed speech signal 353. The reconstructedspeech signal 353 is fed to the voiced/unvoiced switch 360.

The voiced/unvoiced switch 360 simply acts upon the input 362 thatdetermines if the current speech subframe is voiced or unvoiced. If thesubframe is voiced, switch 360 passes on signal 354 to adder 302, and ifthe subframe is unvoiced, signal 353 is passed on to adder 302. Bothsignals (353 and 354) are called signal 365 after switch 360.

The mathematical representation of a voiced speech signal for the novelCELP encoder described in FIG. 3 is given by:

i(n)=g_(p)a(n−L){circle around (X)}h_(f)(n)+g_(pl)b(n){circle around(X)}h_(r)(n)+e(n)

where i(n), n=1, . . . , N is the input sequence to be approximated;

a(n−L) is the ACB sequence selected;

h_(f)(n) is the impulse response of a fixed low-pass filter;

g_(p) is the pitch gain parameter adjusted to maximize the pitchprediction gain;

b(n) is a sparse impulse sequence (unit energy) taken from the SCB;

h_(r)(n) is the impulse response of an all-pole LPC synthesis filterderived from the pitch residual;

g_(pl) is a pulse gain parameter;

e(n) is an error sequence to be minimized (after perceptual weighting);and

{circle around (X)} represents discrete convolution.

The above description of the invention refers to the structure andoperation of the encoder of the audio signal. In a practical system theencoding operation takes normally place at the source of the audiosignal, such as in a telephone set. The audio signal in encoded orcompressed form is transmitted to a remote location where it is decoded.In the encoded form the audio signal includes the filter coefficientsand the excitation segment. At the remote location these two elements,namely the filter coefficients and the excitation segment are processedby the decoder to generate a synthetic audio signal. The decoder has notbeen described in detail because its structure and operation are verysimilar to the audio signal encoder. With reference to FIG. 3, thestructure of the audio signal decoder is identical to the componentsidentified by the box 390 shown in dotted lines. The decoder receivesfor each sub-frame the filter coefficients and the excitation segmentand issues a synthesized audio signal sub-frame. Note that each set ofparameters for a given sub-frame carries an indication as to the natureof the set (either voice or unvoiced). The indication can be a singlebit, the value 0 representing a set of parameters for an unvoiced signalwhile the value 1 represents a set of parameters for a voiced signal.This bit is used to set the voiced unvoiced switch to the properposition so the set of parameters can be transmitted to the propersynthesis stage.

The apparatus illustrated at FIG. 4 can be used to implement thefunction of the encoder 120 whose operation is detailed above inconnection with FIG. 3. The apparatus 500 comprises an input signal line100, an output signal line 105, a processor 514 and a memory 516. Thememory 516 is used for storing instructions for the operation of theprocessor 514 and also for storing the data used by the processor 514 inexecuting those instructions. A bus 518 is provided for the exchange ofinformation between the memory 516 and the processor 514. Theinstructions stored in the memory 516 allow the apparatus to implementthe functional blocks depicted in the diagram at FIG. 3. Thosefunctional blocks can be viewed as individual program elements ormodules that process the data at one of the inputs and issue processeddata at the appropriate output.

Under this mode of construction, the encoder unit and the decoder unitsare actually program elements that are invoked when an encoding/decodingoperation is to be performed. Other forms of implementation arepossible. The encoder unit 120 may be formed by individual circuits,such as microcircuit hardwired on a chip.

In prior art audio signal vocoders, during speech processing operations,it is common practice to smooth out speech sample parameters across eachspeech frame. An example of a parameter that is smoothed is theamplitude of a speech sample. A frame typically comprises a small numberof sub-frames, such as four sub-frames. A common smoothing method is tocalculate the average slope for a given sub-frame of speech samples andto send averaged sample values, corresponding to the calculated slope,to the next speech processing operation. In fact, a more convenientmethod is to send only the slope and the period for which this slope isvalid instead of the actual sample values.

An inherent problem in this smoothing operation is that it changes the“real” characteristics of a speech signal. This problem is exacerbatedwhen, a given frame of speech samples includes voices and unvoicedsub-frames. The result is that the slope calculation discussed above iserroneous since the spectrum for voiced and unvoiced speech is quitedifferent. In many cases this has no severe negative consequences sincethe resulting speech degradation is acceptable for a high bit rate.However, when encoding at low bit rates, the traditional smoothingmethod may significantly degrade the audio quality.

A novel method for smoothing parameters across speech frames isdescribed below. This method has two different embodiments. In a firstpreferred embodiment, the speech sub-frames are classified as voiced orunvoiced. Classifying sub-frames into voiced and unvoiced categories iswell known in the art to which this invention pertains. In a specificexample, the voiced/unvoiced classification is based on informationregarding the selected signal subframe including the relative subframeenergy, the ACB gain, and the error reduction by means of the best entryfrom the pulse stochastic codebook. Once the speech subframes areidentified as voiced or unvoiced a smoothing operation is performed bysmoothing the voiced and unvoiced subframes separately within a frame.In other words, smoothing is applied to sub-frames within a given framehaving the same classification. In a specific example, smoothing of thegain values and the LPC filter coefficients is performed. Smoothingalgorithms are well known in the art to which this invention pertainsand the smoothing of parameters other than the ones mentioned above doesnot detract from the spirit of the invention provided the smoothing isapplied separately on voice and unvoiced speech sub-frames.

An apparatus for smoothing audio signal frames in accordance with thisembodiment is depicted in FIG. 5. At the input of the apparatus issupplied an audio signal frame to be processed. The frame has foursub-frames, there being three voiced sub-frames and one unvoicedsub-frame. A voiced/unvoiced classifier 600 processes the sub-framesindividually according to determine if they fall in the voiced orunvoiced category by any one of the prior art methods mentioned earlier.They sub-frames that are declared as voiced are directed to .a smoothingblock 602 (that operates according to prior art methods), while thesub-frames that are declared unvoiced are directed to a smoothing block604. Both smoothing blocks can be identical or use different algorithms.The smoothed sub-frames are then re-assembled in their original order toform the smoothed audio signal frame.

In a second embodiment illustrated in FIG. 6, a unvoiced/voicedclassifier examines each frame that arrives at its input. Are-classification block will change the class of a given sub-frameaccording to a selected heuristics model to avoid multiple transitionsvoiced-unvoiced and vice-versa. The heuristics model may be such as tochange the classification of a certain sub-frame when that sub-frame issurrounded by sub-frames of a different class. For example, the framevoiced|voiced|unvoiced|voiced, when processed by the re-classifier 702will become voiced|voiced|voiced|voiced. Smoothing is then separatelyperformed on the resulting sub-frames in a similar manner as describedabove. More specifically, isolated voiced or unvoiced sub-frames arereclassified so that only one voiced to unvoiced or unvoiced to voicedchange is retained in any one frame.

The apparatus depicted in FIGS. 5 and 6 can be implemented on anysuitable computing platform of the type illustrated in FIG. 4.

The above description of a preferred embodiment of the present inventionshould not be read in a limitative manner as refinements and variationsare possible without departing from the spirit of the invention. Thescope of the invention is defined in the appended claims and theirequivalents.

I claim:
 1. An audio signal encoding device comprising: a) an input forreceiving a sub-frame of an audio signal to be encoded; b) an adaptivecodebook in which is stored at least one prior knowledge entry, saidprior knowledge entry including a data element representative ofcharacteristics of at least a portion of a previously synthesised audiosignal sub-frame; c) a processing unit in operative relationship withsaid input and with said adaptive codebook, said processing unit beingoperative for synthesising a set of parameters to generate a synthesisedaudio signal sub-frame on a basis of at least: i. the sub-frame of anaudio signal received at said input; ii. the data element in saidadaptive codebook.
 2. An audio signal encoding device as defined inclaim 1, wherein said data element is representative of characteristicsof at least one previously synthesised audio signal sub-frame.
 3. Anaudio signal encoding device as defined in claim 2, wherein saidadaptive codebook stares a plurality of prior knowledge entries, eachprior knowledge entry including a data element representative ofcharacteristics of at least one previously synthesised audio signalsub-frame.
 4. An audio signal encoding device as defined in claim 3,wherein each prior knowledge entry includes a set of samples from apreviously synthesised audio signal sub-frame.
 5. An audiosignal-encoding device as defined in claim 2, wherein each priorknowledge entry is a set of samples of a previously synthesised audiosignal sub-frame associated to the audio signal received at said input.6. An audio signal encoding device as defined in claim 5, wherein saidadaptive codebook includes: (a) an adaptive codebook input; (b) anadaptive codebook output; said adaptive codebook, in response toreceiving at said adaptive codebook input a parameter indicative of aselected one of the data elements in the codebook, releasing at saidadaptive codebook output samples associated with the previouslysynthesised audio signal sub-frame corresponding to said selected one ofthe data elements.
 7. An audio signal encoding device as defined inclaim 6, said audio signal encoding device comprising a gain multipliercoupled to said adaptive codebook output to multiply the samplesassociated with a previously synthesised audio signal sub-frame at saidadaptive codebook output by a certain gain value to provide a periodiccomponent of the synthesised audio signal.
 8. An audio signal decodingdevice for synthesising a certain audio signal sub-frame from a set ofparameters derived from an original audio signal sub-frame, said audiosignal decoding device comprising: i. an input for receiving the set ofparameters derived from the original audio signal sub-frame; ii. anadaptive codebook in which is stored at least one prior knowledge entry,said prior knowledge entry including a data element representative ofcharacteristics of at least a portion of an audio signal sub-framepreviously synthesised by said audio signal decoding device; iii. aprocessing unit in operative relationship with said input and with saidadaptive codebook, said processing unit being operative for synthesisingthe certain audio signal sub-frame on a basis of at least: (a) the setof parameters received at said input; (b) the data element in saidadaptive codebook.
 9. An audio signal decoding device as defined inclaim 8, wherein said adaptive codebook includes a plurality of priorknowledge entries, each prior knowledge entry including a data elementrepresentative of characteristics of at least one previously synthesisedaudio signal sub-frame.
 10. An audio signal decoding device as definedin claim 9, wherein each prior knowledge entry includes a set of samplesfrom at least one previously synthesised audio signal sub-frame.
 11. Amethod for synthesising a certain audio signal sub-frame from a set ofparameters derived from an original audio signal sub-frame, said methodcomprising the steps of: a) receiving the set of parameters derived fromthe original audio signal sub-frame; b) providing an adaptive codebookin which is stored at least one prior knowledge entry, said priorknowledge entry including a data element representative ofcharacteristics of at least a portion of an audio signal sub-framepreviously synthesised by said audio signal decoding device; c)synthesising the certain audio signal sub-frame on a basis of at least:i. the set of parameters received at said input; ii. the data element insaid adaptive codebook.
 12. A method as defined in claim 11, whereinsaid adaptive codebook includes a plurality of prior knowledge entries,each prior knowledge entry including a data element representative ofcharacteristics of at least one previously synthesised audio signalsub-frame.
 13. A method as defined in claim 12, wherein each priorknowledge entry includes a set of samples from at least one previouslysynthesised audio signal sub-frame.